weekdays - 2 levels: Workday; WeekendSumWin - 4 levels: Winter2; Summer1; Summer2;
Winter1aftOpen - how many days after the open datetoClose - how many days to close dateTrs$HolBef - The day before is a holidayHolAft - The day after is a holidaytempLev - 2 levels: Hot(>40); Cold(<=40)prev1Prof - the sum of profit on the previous 1
dayprev5Prof - the sum of profit in the previous 5
daysprev1Temp - the sum of temperature on the previous 1
dayprev5Temp - the sum of temperature on the previous 5
dayDofNoProfBef - How many consecutive days of no profit
before the current dayprevDRain - Did it rain the day beforeDofRainfBef - How many consecutive days of rain before
the current day| location_id | PredictedProfit |
|---|---|
| 11 | 17716 |
| 12 | 17294 |
| 13 | 16061 |
Since it’s periodic data, we can also try AR Model.
For real data study, I may also consider finance as a feature. When going to a new location to set up transactions, we should also consider the nearby competitors. And in the beginning, when users are not yet familiar with this company, the profit will be much less than other locations.
When I divide features in different levels, I can try to use cluster methods.
When I fill missing value, I can fit a model (regression/classification) to predict those missing value.
If same transaction_id means same customer, we can
add this information in our model to make a better prediction.
Hol <- read.csv("data/holiday_data.csv") %>% as_tibble()
Loc <- read.csv("data/location_data.csv") %>% as_tibble()
Wet <- read.csv("data/weather_data.csv") %>% as_tibble()
Tr1 <- fromJSON(file = "data/transactions_001_system3.json")
Tr2 <- fromJSON(file = "data/transactions_002_system3.json")
Tr3 <- read.table("data/transactions_003_system2.txt", header = T)
Tr4 <- read.table("data/transactions_004_system2.txt", header = T)
Tr5 <- fromJSON(file = "data/transactions_005_system3.json")
Tr6 <- read.table("data/transactions_006_system2.txt", header = T)
Tr7 <- read.table("data/transactions_007_system2.txt", header = T)
Tr8 <- read.csv("data/transactions_008_system1.csv")
Tr9 <- read.table("data/transactions_009_system2.txt", header = T)
Tr10 <- fromJSON(file = "data/transactions_010_system3.json")
Transform the data type for each column
transJSON <- function(table){
Tab <- table %>%
sapply(function(l)
c(ifelse(is.null(l$location_id), NA, l$location_id),
ifelse(is.null(l$date), NA, l$date),
ifelse(is.null(l$transaction_id), NA, l$transaction_id),
ifelse(is.null(l$profit), NA, l$profit))) %>%
data.frame(row.names = c("location_id", "date",
"transaction_id", "profit")) %>%
t %>% as.data.frame() %>%
mutate(location_id = as.integer(location_id),
date = as.Date(date),
transaction_id = as.integer(transaction_id),
profit = as.numeric(profit))
return(Tab)
}
Tr1 <- Tr1 %>% transJSON %>% as_tibble()
Tr2 <- Tr2 %>% transJSON %>% as_tibble()
Tr5 <- Tr5 %>% transJSON %>% as_tibble()
Tr10 <- Tr10 %>% transJSON %>% as_tibble()
transtxt <- function(table){
Tab <- table %>%
mutate(location_id = as.numeric(location_id),
date = as.Date(date, tryFormats = c("%m-%d-%Y")),
transaction_id = as.numeric(transaction_id),
profit = profit %>% strsplit("-") %>%
sapply(function(l)l[length(l)]) %>%
as.numeric()) %>% as_tibble()
return(Tab)
}
Tr3 <- Tr3 %>% transtxt()
Tr4 <- Tr4 %>% transtxt()
Tr6 <- Tr6 %>% transtxt()
Tr7 <- Tr7 %>% transtxt()
Tr9 <- Tr9 %>% transtxt()
Tr8 <- Tr8 %>% mutate(location_id = as.integer(location_id),
date = as.Date(date, tryFormats = c("%m/%d/%Y")),
transaction_id = as.integer(transaction_id),
profit = profit %>% substring(2) %>% as.numeric()) %>%
as_tibble()
Deal Missing Data in Table transactions_001 - 010
for (i in 1:10) {
print(get(paste0("Tr",i)) %>% is.na() %>% colSums())
}
## location_id date transaction_id profit
## 0 0 0 2
## location_id date transaction_id profit
## 0 0 0 3
## location_id date transaction_id profit
## 0 0 0 1
## location_id date transaction_id profit
## 0 0 0 6
## location_id date transaction_id profit
## 0 0 0 5
## location_id date transaction_id profit
## 0 0 0 3
## location_id date transaction_id profit
## 0 0 0 6
## location_id date transaction_id profit
## 0 0 0 0
## location_id date transaction_id profit
## 0 0 0 0
## location_id date transaction_id profit
## 0 0 0 1
In most of these 10 tables, missing data exist in the final column,
profit.
Before I do anything, I should clarify the reason for missing data. If missing means profit equals 0, I can just remove those rows or fill them by 0. If the missing value means missing of record, deleting the row will lead miss of information. In this case, if we want more accurate values, we can fill those missing values by the mean profit value on a corresponding day. I’m not sure if the same transaction id was made by the same customer. If it is, I may also find the mean value of the profit with the same transaction id in the previous and the next day if there exists. A more complex approach is k nearest neighbor or fits a regression to ‘predict’ those missing values. We can even use the EM algorithm if you think those missing values are extremely important.
After applying the following codes
for (i in 1:10) {
print(get(paste0("Tr",i)) %>%
(function(t)
t[t$date %in% (t$date[t$profit %>% is.na()]),]))
}
## # A tibble: 2 × 4
## location_id date transaction_id profit
## <int> <date> <int> <dbl>
## 1 1 2020-02-07 1 NA
## 2 1 2020-02-14 1 NA
## # A tibble: 3 × 4
## location_id date transaction_id profit
## <int> <date> <int> <dbl>
## 1 2 2020-05-29 1 NA
## 2 2 2020-05-30 1 NA
## 3 2 2020-06-15 1 NA
## # A tibble: 1 × 4
## location_id date transaction_id profit
## <dbl> <date> <dbl> <dbl>
## 1 3 2020-02-10 1 NA
## # A tibble: 6 × 4
## location_id date transaction_id profit
## <dbl> <date> <dbl> <dbl>
## 1 4 2020-02-07 1 NA
## 2 4 2020-02-14 1 NA
## 3 4 2020-05-28 1 NA
## 4 4 2020-05-30 1 NA
## 5 4 2021-02-05 1 NA
## 6 4 2021-02-17 1 NA
## # A tibble: 5 × 4
## location_id date transaction_id profit
## <int> <date> <int> <dbl>
## 1 5 2019-02-26 1 NA
## 2 5 2020-02-17 1 NA
## 3 5 2020-05-28 1 NA
## 4 5 2020-06-01 1 NA
## 5 5 2021-02-03 1 NA
## # A tibble: 3 × 4
## location_id date transaction_id profit
## <dbl> <date> <dbl> <dbl>
## 1 6 2020-02-05 1 NA
## 2 6 2020-05-28 1 NA
## 3 6 2021-02-10 1 NA
## # A tibble: 6 × 4
## location_id date transaction_id profit
## <dbl> <date> <dbl> <dbl>
## 1 7 2020-02-17 1 NA
## 2 7 2020-05-28 1 NA
## 3 7 2020-05-29 1 NA
## 4 7 2021-02-15 1 NA
## 5 7 2021-02-17 1 NA
## 6 7 2021-02-18 1 NA
## # A tibble: 0 × 4
## # … with 4 variables: location_id <int>, date <date>, transaction_id <int>,
## # profit <dbl>
## # A tibble: 0 × 4
## # … with 4 variables: location_id <dbl>, date <date>, transaction_id <dbl>,
## # profit <dbl>
## # A tibble: 1 × 4
## location_id date transaction_id profit
## <int> <date> <int> <dbl>
## 1 10 2021-02-02 1 NA
I found that there are no records on the same day at that location for all the missing values. Thus, I assume missing values are wrong records. Then, I just delete those rows.
for (i in 1:10) {
assign(paste0("Tr",i),
get(paste0("Tr",i)) %>% na.omit())
}
# Sum profits from diff. transactions in the same day.
trans1day <- function(table){
table <- table %>% group_by(date) %>%
summarise(location_id = unique(location_id),
profit = sum(profit),
n=n())
return(table)
}
for (i in 1:10) {
assign(paste0("Tr", i), get(paste0("Tr", i)) %>% trans1day %>%
subset(select=c(2,1,3,4)))
}
# Combined data from different locations into same table
Trs <- rbind(Tr1, Tr2, Tr3, Tr4, Tr5,
Tr6, Tr7, Tr8, Tr9, Tr10)
# Combine the location and weather information with the `Trs` table
# Trs <- left_join(Trs, Loc, by = "location_id")
Trs <- expand.grid(location_id=1:13,
date = seq(from = as.Date("2019-01-01"),
to = as.Date("2022-10-30"),
by = "day")) %>%
as_tibble() %>%
left_join(Trs, by = c("location_id", "date")) %>%
left_join(Loc, by = "location_id")
# Add a column to describe the year of the date
Trs$year <- as.numeric(format(Trs$date,"%Y"))
# Add a column to describe the Month and Day
Trs$md <- as.character(format(Trs$date,"%m-%d"))
# Add a column to describe the Weekday
Trs$weekdays <-
factor(format(Trs$date,"%a"),
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
# Change the Date Type of "date" in Holiday table and Weather table
Hol$date <- Hol$date %>% as.Date()
Wet$date <- Wet$date %>% as.Date()
# Make sure there is only one weather data in each day at each location
Wet %>% group_by(location_id, date) %>%
summarise(n=n()) %>% `$`(n) %>% `==`(1) %>% all()
## `summarise()` has grouped output by 'location_id'. You can override using the
## `.groups` argument.
## [1] TRUE
Trs tableCheck the distribution of the response variable:
Trs$profit %>% hist
This response variable looks normally distributed. Thus, I will use a linear regression model instead of generalized linear regression.
Plot the date vs. profit:
Trs %>% filter(location_id<=10) %>%
ggplot(aes(x = date, y = profit))+
geom_point()+
facet_grid(location_id~.)
It looks like the center closed for periods each year. That’s why there are no records. I will consider those as the missing value instead of assigning them to 0.
Also, I checked does all locations closed on the same day:
Trs %>% filter(location_id<=10) %>%
group_by(date) %>%
summarise(n=sum(!is.na(profit))) %>% `$`(n) %>% table
## .
## 0 1 2 3 4 5 6 7 8 9 10
## 572 38 15 13 4 4 2 5 11 13 722
The answer is no. So I can’t assume the new location will shut down on the same days. Therefore, I will also predict the profit from all candidate locations when they should be closed.
The marginal relationship between weekdays and profit
Trs %>% ggplot(aes(x=profit)) +
geom_histogram(bins = 40)+
facet_grid(rows = as.factor(Trs$weekdays),
scales = "free_y")
Since the distribution of profit on each weekday is similar. Also, the distribution of profits on Saturday and Sunday is similar. But workdays are different from weekends. So, divide the weekdays into two levels. The first level is workday and the second level is weekend.
Trs <- Trs %>% mutate(weekdays = ifelse(weekdays %in% c("Sat", "Sun"), "Weekend", "Workday"))
(Trs %>% filter(location_id<=10) %>%
ggplot(aes(x = as.Date(md, format = "%m-%d"),
y = profit,
color = as.factor(location_id))) +
geom_point(alpha=0.5)+
geom_vline(xintercept = as.Date(c("2022-05-01", "2022-10-01")))+
labs(x = "Date")+
scale_x_continuous(breaks = as.Date(c("2022-01-01", "2022-04-01",
"2022-07-01", "2022-10-01",
"2023-01-01")),
labels = c("01-01", "04-01",
"07-01", "10-01",
"01-01"))) %>%
ggplotly
I will add one feature to describe summer or winter since the profit pattern for summer and winter are different.
Trs$SumWin <- Trs$date %>% format("%m") %>%
as.numeric() %>% (function(n) ifelse(n<5, 1,
ifelse(n<7, 2,
ifelse(n<10, 3, 4)))) %>%
factor(labels = c("Winter2", "Summer1", "Summer2", "Winter1"))
Obviously, profits increase when rental locations are open after a long closure and decrease for a period of time before the next closure. I could add a new feature to cover this information, for example, how many days after the last closing and how many days before the next closing. But I don’t know the closure information for the new rental location. I will use the earliest open date and the latest close date as the open day and close day.
Trs %>% filter(!is.na(Trs$profit),
as.numeric(format(date, "%m"))<5) %>%
`$`(date) %>% max()
## [1] "2022-03-18"
Trs %>% filter(!is.na(Trs$profit),
as.numeric(format(date, "%m"))>=5) %>%
`$`(date) %>% min()
## [1] "2019-05-26"
Trs %>% filter(!is.na(Trs$profit),
as.numeric(format(date, "%m"))<10) %>%
`$`(date) %>% max()
## [1] "2022-09-11"
Trs %>% filter(!is.na(Trs$profit),
as.numeric(format(date, "%m"))>=10) %>%
`$`(date) %>% min()
## [1] "2019-11-29"
Trs$aftOpen <- 0
Trs$toClose <- 0
Trs$toClose[Trs$SumWin=="Winter2"] <-
Trs %>% filter(SumWin=="Winter2") %>%
mutate(close = as.Date(paste0(year, "-03-18")),
toClose = ifelse(close-date<0,0,close-date)) %>%
`$`(toClose)
Trs$toClose[Trs$SumWin=="Summer2"] <-
Trs %>% filter(SumWin=="Summer2") %>%
mutate(close = as.Date(paste0(year, "-09-11")),
toClose = ifelse(close-date<0,0,close-date)) %>%
`$`(toClose)
Trs$aftOpen[Trs$SumWin=="Winter1"] <-
Trs %>% filter(SumWin=="Winter1") %>%
mutate(open = as.Date(paste0(year, "-11-29")),
toClose = ifelse(date-open<0,0,date-open)) %>%
`$`(toClose)
Trs$aftOpen[Trs$SumWin=="Summer1"] <-
Trs %>% filter(SumWin=="Summer1") %>%
mutate(open = as.Date(paste0(year, "-05-26")),
toClose = ifelse(date-open<0,0,date-open)) %>%
`$`(toClose)
I’m considering the impact of holidays on profit. Use the following plots to see if the days before and after a holiday will significantly differ from other times.
Trs %>% ggplot(aes(profit)) +
geom_histogram(bins = 40) +
facet_grid(rows = (Trs$date-1) %in% Hol$date, scales = "free_y")
Trs %>% ggplot(aes(profit)) +
geom_histogram(bins = 40) +
facet_grid(rows = (Trs$date+1) %in% Hol$date, scales = "free_y")
From these two histograms, I believe there is a significant difference. Then, add a new feature to include this information.
Trs$HolBef <- (Trs$date-1) %in% Hol$date
Trs$HolAft <- (Trs$date+1) %in% Hol$date
By the introduction of this case study, there is a small amount of missing weather data. I want to fill in those missing weather data first.
Wet <- expand.grid(location_id=1:13,
date = seq(from = as.Date("2019-01-01"),
to = as.Date("2022-10-30"),
by = "day")) %>%
left_join(Wet, by = c("location_id", "date")) %>%
as_tibble()
# Wet <- Wet %>% left_join(Loc, by = c("location_id"))
Plot date vs. temperature:
Wet %>% ggplot(aes(x = date, y = temperature)) + geom_point() + facet_grid(rows = Wet$location_id)
This is periodic data. We can use a general periodic model or autoregression model to build the model and then predict the missing temperature. But, I will use a simple way, impute missing data by mean temperatures in 6 nearest neighbor days at the same location.
for (i in which(is.na(Wet$temperature))) {
Wet$temperature[i] <-
mean(Wet$temperature[Wet$date>(Wet$date[i]-3) &
Wet$date<(Wet$date[i]+3) &
Wet$location_id==Wet$location_id[i]], na.rm = T)
}
Plot temperature vs. profit
Trs %>% left_join(Wet, by = c("location_id", "date")) %>%
filter(location_id<=10) %>%
ggplot(aes(temperature, profit))+
geom_point()+geom_vline(aes(xintercept=40))+
facet_wrap(~location_id)
Obviously, there are two groups. The trends in these two groups are different. There are no records around temperature=40. I will add a new feature to divide temperature into two levels.
Wet$tempLev <- ifelse(Wet$temperature>40, "Hot", "Cold") %>% factor()
Plot date vs. pressure:
Wet %>% ggplot(aes(x = date, y = pressure, color = temperature)) +
geom_point() + facet_grid(rows = Wet$location_id)
There is no obvious pattern from these plots. I will assign mean value of all pressure to missing.
Wet <- Wet %>% mutate(pressure = ifelse(is.na(pressure), mean(Wet$pressure, na.rm = T), pressure))
Plot pressure vs. profit
Trs %>% left_join(Wet, by = c("location_id", "date")) %>%
filter(location_id<=10) %>%
ggplot(aes(pressure, profit))+geom_point()+
facet_wrap(~location_id)
Actually, from these plots I don’t think pressure is a significant feature to predict profit. I will still add it to our model.
Draw the histogram plot of humidity data. The distribution is a little bit strange. The lower the humidity, the fewer days, except 0.
Wet[order(Wet$date),] %>% filter(location_id == 1) %>% `$`(humidity) %>% hist(breaks=50)
I prefer to consider those 0 as missing value (I am not quite sure if this step is correct and need to check the source of the information).
Wet <- Wet %>% mutate(humidity = ifelse(humidity==0, NA, humidity))
Plot date vs. humidity:
Wet %>% ggplot(aes(x = date, y = humidity)) +
geom_point() + facet_grid(rows = Wet$location_id)
The distribution of humidity looks more complex. If I wanted to predict missing data more accurately, I have an idea. I may build a model as follows:
\[ \begin{aligned} \text{Humidity}&\sim\text{Unif}(a, 1)\\ a&=\beta_0+\beta_1\text{Temperature}+\epsilon\\ \epsilon&\sim N(\mu, \sigma) \end{aligned} \]
I won’t do it this time because that’s not the goal of this case study. I use a simple way, 3 nearest neighbors again.
for (i in which(is.na(Wet$humidity))) {
Wet$humidity[i] <-
mean(Wet$humidity[Wet$date>(Wet$date[i]-3) &
Wet$date<(Wet$date[i]+3) &
Wet$location_id==Wet$location_id[i]], na.rm = T)
}
Plot humidity vs. profit
Trs %>% left_join(Wet, by = c("location_id", "date")) %>%
filter(location_id<=10) %>%
ggplot(aes(humidity, profit))+geom_point()+
facet_wrap(~location_id)
Same conclusion with pressure.
Since these two variables are numeric, I can build a logistic regression model or use some machine learning model, for example, random forest or XGBoost, to do classification. But that is a big topic, and the information we used is inaccurate. So, I will add a new level of ‘Unknow’ in these two variables. In the model, this level will be considered the mean.
Wet <- Wet %>% mutate(cloudy=ifelse(is.na(cloudy),
"Unknow", cloudy),
precipitation=ifelse(is.na(precipitation),
"Unknow", precipitation)) %>%
mutate(cloudy=factor(cloudy, levels = c("Unknow", "FALSE", "TRUE"),
labels = c("Unknow", "No", "Yes")),
precipitation=factor(precipitation,
levels = c("Unknow", "FALSE", "TRUE"),
labels = c("Unknow", "No", "Yes")))
cloudy vs profit
Trs %>% left_join(Wet, by = c("location_id", "date")) %>%
ggplot(aes(profit))+geom_histogram(bins = 40)+
facet_wrap(~cloudy, scales = "free_y")
precipitation vs profit
Trs %>% left_join(Wet, by = c("location_id", "date")) %>%
ggplot(aes(profit))+geom_histogram(bins = 40)+
facet_wrap(~precipitation, scales = "free_y")
Trs %>% subset(select = c(location_id, date, profit)) %>%
left_join(Wet, by = c("location_id", "date")) %>%
subset(select=-c(location_id, date, tempLev)) %>%
ggpairs(ggplot2::aes(colour=Wet$tempLev, alpha=0.5))
Combine all tables:
FullD <- Trs %>% left_join(Wet, by = c("location_id", "date"))
Add more features:
prev1Prof - the sum of profit on the previous 1 day
prev5Prof - the sum of profit in the previous 5 days
prev1Temp - the sum of temperature on the previous 1
day
prev5Temp - the sum of temperature on the previous 5
day
DofNoProfBef - How many consecutive days of no profit
before the current day
3 levels: 0; 1-4; >=5
prevDRain - Did it rain the day before
3 levels: Unknow; No; Yes
DofRainfBef - How many consecutive days of rain before the
current day
3 levels: 0; 1-2; >=3
FullD$prev1Prof <- NA
FullD$prev5Prof <- NA
FullD$prev1Temp <- NA
FullD$prev5Temp <- NA
FullD$DofNoProfBef <- 0
FullD$prevDRain <- c(as.factor("Unknow"),FullD$precipitation[-nrow(FullD)])
FullD$DofRainfBef <- 0
FullD$DofRainfBef[FullD$prevDRain=="Yes"] = 1
for (i in 1:nrow(FullD)) {
FullD$prev1Prof[i] <- sum(FullD$profit[FullD$location_id==FullD$location_id[i] &
FullD$date==(FullD$date[i]-1)], na.rm = T)
FullD$prev5Prof[i] <- sum(FullD$profit[FullD$location_id==FullD$location_id[i] &
FullD$date>(FullD$date[i]-6) &
FullD$date<FullD$date[i]], na.rm = T)
FullD$prev1Temp[i] <- sum(Wet$temperature[Wet$location_id == FullD$location_id[i] &
Wet$date==(FullD$date[i]-1)])
FullD$prev5Temp[i] <- sum(Wet$temperature[Wet$location_id==FullD$location_id[i] &
Wet$date>(FullD$date[i]-6) &
Wet$date<FullD$date[i]])
if((Wet$precipitation[Wet$location_id==FullD$location_id[i] &
Wet$date>(FullD$date[i]-4) &
Wet$date<FullD$date[i]] == "Yes") %>%
sum() %>% `==`(3)){
FullD$DofRainfBef[i] <- 3
}
}
FullD$DofRainfBef <- factor(FullD$DofRainfBef, levels = c(0,1,3),
labels = c("0", "1-2", ">=3"))
for (i in which(!FullD$date %in% (FullD$date+1))) {
FullD$DofNoProfBef[i] = 1
if((FullD$location_id==FullD$location_id[i] &
FullD$date>(FullD$date[i]-6) &
FullD$date<FullD$date[i]) %>% sum %>% `==`(0)){
FullD$DofNoProfBef[i] = 5
}
}
FullD$DofNoProfBef <- factor(FullD$DofNoProfBef, levels = c(0,1,5),
labels = c("0", "1-4", ">=5"))
FullD <- FullD %>% filter(date>(as.Date("2019-01-01")+4))
As I discussed before, I will use a linear regression model. I will use cross-validation to do the feature selection. There are some other ways, for example, stepwise.
First, randomly choose a training set. Fit the model by training set data. Then, try to predict profit at testing data. Compare with the true profit.
library(glmnet)
FullD_all <- FullD %>% filter(!is.na(profit)) %>%
subset(select=-c(location_id, date, n, year, md))
Cho <- sort(sample(1:nrow(FullD_all), 6000))
FullD_train <- FullD_all[Cho,]
FullD_test <- FullD_all[-Cho,]
FullDM_all <- model.matrix(profit~population+elevation+weekdays+
HolBef+HolAft+
SumWin+temperature+pressure+humidity+
SumWin:toClose+SumWin:aftOpen+
SumWin:temperature+
SumWin:pressure+
SumWin:humidity+
cloudy+precipitation+
SumWin:cloudy+SumWin:precipitation+
tempLev+
prev1Prof+prev5Prof+
prev1Temp+prev5Temp+
DofNoProfBef+prevDRain+DofRainfBef-1,
data = FullD_all)
FullDM_train <- FullDM_all[Cho,]
FullDM_test <- FullDM_all[-Cho,]
fit_cv <- cv.glmnet(x = FullDM_train,y = FullD_train$profit)
fit1 <- glmnet(x = FullDM_train, y = FullD_train$profit,
family = "gaussian", lambda = fit_cv$lambda.min)
Plot true profit vs. predicted profit
plot(FullD_train$profit,
predict(fit1, newx = FullDM_train),
main = "Training data")
lines(x = c(-100, 1000), y = c(-100, 1000), col = "red")
plot(FullD_test$profit,
predict(fit1, newx = FullDM_test),
main = "Testing data")
lines(x = c(-100, 1000), y = c(-100, 1000), col = "red")
True total profit vs. predicted total profit for testing data
sum(FullD_test$profit)
## [1] 441206.6
sum(predict(fit1, newx = FullDM_test))
## [1] 440785.9
I also tried to use data from location 1 to location 9 as training set. Fit the model by training set data. Then, try to predict profit at location 10. Compare with the true profit.
FullD_9 <- FullD %>% filter(!is.na(profit),
location_id < 10) %>%
subset(select=-c(location_id, date, n, year, md))
FullD_10 <- FullD %>% filter(!is.na(profit),
location_id == 10) %>%
subset(select=-c(location_id, date, n, year, md))
FullDM_all <- model.matrix(profit~population+elevation+weekdays+
HolBef+HolAft+
SumWin+temperature+pressure+humidity+
SumWin:toClose+SumWin:aftOpen+
SumWin:temperature+
SumWin:pressure+
SumWin:humidity+
cloudy+precipitation+
SumWin:cloudy+SumWin:precipitation+
tempLev+
prev1Prof+prev5Prof+
prev1Temp+prev5Temp+
DofNoProfBef+prevDRain+DofRainfBef-1,
data = rbind(FullD_9,FullD_10))
FullDM_9 <- FullDM_all[1:nrow(FullD_9),]
FullDM_10 <- FullDM_all[-c(1:nrow(FullD_9)),]
fit_cv <- cv.glmnet(x = FullDM_9, y = FullD_9$profit)
fit2 <- glmnet(x = FullDM_9, y = FullD_9$profit,
family = "gaussian", lambda = fit_cv$lambda.min)
Plot true profit vs. predicted profit
plot(FullD_9$profit,
predict(fit2, newx = FullDM_9),
main = "Training data")
lines(x = c(-100, 1000), y = c(-100, 1000), col = "red")
plot(FullD_10$profit,
predict(fit2, newx = FullDM_10),
main = "Testing data")
lines(x = c(-100, 1000), y = c(-100, 1000), col = "red")
True total profit vs. predicted total profit for testing data
sum(FullD_10$profit)
## [1] 217568.4
sum(predict(fit2, newx = FullDM_10))
## [1] 224459.2
Use all data from locations 1 to 10 to fit the model. Then, predict profit in 3 candidate locations.
FullD_Know <- FullD %>% filter(!is.na(profit)) %>%
subset(select=-c(location_id, date, n, year, md))
FullD_Pred <- FullD %>% filter(location_id > 10,
date >= as.Date("2022-01-01") &
date<=as.Date("2022-10-30")) %>%
subset(select=-c(location_id, date, n, year, md))
FullDM <- model.matrix(~population+elevation+weekdays+
HolBef+HolAft+
SumWin+temperature+pressure+humidity+
SumWin:temperature+
SumWin:pressure+
SumWin:humidity+
cloudy+precipitation+
SumWin:cloudy+SumWin:precipitation+
tempLev+
prev1Prof+prev5Prof+
prev1Temp+prev5Temp+
DofNoProfBef+prevDRain+DofRainfBef-1,
data = rbind(FullD_Know, FullD_Pred))
FullDM_Know <- FullDM[1:nrow(FullD_Know),]
FullDM_Pred <- FullDM[-c(1:nrow(FullD_Know)),]
fit_cv <- cv.glmnet(x = FullDM_Know, y = FullD_Know$profit)
fit <- glmnet(x = FullDM_Know, y = FullD_Know$profit,
family = "gaussian", lambda = fit_cv$lambda.min)
Pred <- FullD %>% filter(location_id > 10,
date >= as.Date("2022-01-01") &
date<=as.Date("2022-10-30")) %>%
subset(select=c(location_id, date)) %>%
data.frame(predict(fit, newx = FullDM_Pred)) %>%
`colnames<-`(c("location_id", "date", "predProf")) %>%
as_tibble() %>%
left_join(FullD %>% filter(location_id <= 10,
date >= as.Date("2022-01-01") &
date<=as.Date("2022-10-30")) %>%
group_by(date) %>%
summarise(closure = sum(!is.na(profit))/10), by = "date")
PredPlot <- Pred %>% filter(closure != 0) %>%
ggplot(aes(date, predProf,
color = as.factor(location_id))) +
geom_point(alpha = (Pred %>% filter(closure != 0))$closure)
ggplotly(PredPlot)
Pred %>% group_by(location_id) %>%
summarise(predProf = sum(predProf*closure))
## # A tibble: 3 × 2
## location_id predProf
## <dbl> <dbl>
## 1 11 17716.
## 2 12 17294.
## 3 13 16061.
From the predicted profit calculated before, the candidate location with location_id=11 has the largest profit from January 1st through October 30th of 2022.